There are a lot of tasks and steps that we need to work through in our everyday data science work. These can be as small as loading files in different kinds of configuration environments to something like creating very specific like repeated data visualizations for comparisons. Since we have deadlines and timed time limits to meet we we should be able to create usable obtain usable results in a reasonable amount of time. In some situations. I have noticed that we tend to fight with our tools, which in my opinion is the opposite of what it should be. The tools were made by people in situations similar to what we find ourselves in to solve a particular set of problems that they have. Now when we faced a similar problem, we should look to those tools rather than trying to make our own or trying to switch tools designed for situation. A into situation B, and then getting frustrated when it doesn't, and the process isn't smooth.

My goal with this exercise is to use common data science tasks, and idioms, as examples to show features of tools, not only the ones that would be new to the audience and are designed for a very very niche and specific thing, which still comes up quite often, but also the tools which are our bread and butter, but have certain hidden features, which can make our life significantly easier.

The data

Let's start by quickly going over the two data sets that we will be using as recurring examples, because we of course we need some data to work with both datasets I've chosen are extremely common and standard examples for data science tutorials, so hopefully everyone is familiar with the data sets, themselves. So, we don't need to spend a lot of time going over the specifics of the data set, but rather, rather than that we can focus on the methods and techniques that I will present.

The first data set is the stereotypical Iris data set, which is, which consists of four numeric variables features of the flowers. Three, three species, it has data, consider covering three species for three data points per each species and just for convenience later I have converted one or the categorical species column into an integer. That just an encoding stem.

The second data set is the cars data set. This consider this has the data of load 400 cars from the 70s. It has, like the numeric variables about cars like their horsepower and mileage etc but also some information about the country of origin and the name of the car is present here. There are some indigenous value variables here some floating point or fried American values there is a date column here but notice it's always January the first. So, despite the variable being encoded as a daytime, it doesn't actually have fine grained date information down to the specific day of month. It's just a placeholder for the year.

There are a couple of tables we will use only in section 3.5 but we won't be doing any analysis on their contents so I have chosen to omit them from this section.

Timesaving libraries
- don't reinvent the wheel

Now that we have the data set out of the way

pandas_profiler - seamless data diagnostics

One of the first things we do when we get a new data set is to just look at it, which we immediately call exploratory data analysis. Contrary to what a lot of people think when they're starting out, is not a set of predefined steps or recipes that you have to follow or a checklist, the checklist approach helps, initially, to make sure that you don't miss the obvious things, but the going through the obvious thing is not the actual end goal of exploratory data analysis.

Despite this, the going through the checklist is almost always going to be very similar you're going to print some basic stats of the data set. Check the number and percentage of missing values, see which types of data types are there, check the cardinalities of the categorical types check the limits of the data types.

This is the standard sort of numeric steps that you take in an EDI step. And you also make some standard visualizations, like the pear plot, which is pairwise scatter plots to check the correlations and relationship types that there's some nonlinear effect going on or something. The histograms, or univariate distribution plots, like histograms and box plots. This is a very standard affair. and I created sort of like a small motivating example that the code for this, while conceptually these things are trivial, at least in terms of the implementation. The writing the code is super tedious for them, and it is going to be like 90% same for all data sets that you will ever work with, and the I've noticed at least in my workflow whenever something is tedious. I unconsciously train myself to turn off my brain at that moment and that's what I tend to make mistakes.

So I found this package. A few years ago, when I noticed this bad habit that I was developing. I wanted a package that would that would do all the standard tedious stuff for me, so that I can just look at the statistics and the plots and try to tease out the interesting pattern which is actually my job as a data scientist not that it's not to write code. I use code as a means to an end.

The package is called pandas profiler, and it's just these two lines, there are some optional arguments that you can look look into the docs for and customize the report as you wish, but just these two lines, give you a very detailed report for all the standard things. Now of course, since this is a generic function which will work for any kind of pandas DataFrame, it will not give you specific information design to describe a very specialized kind of data set that you might have. But that's okay.

For those for those data sets you can write those two or three bits of code.

Now just with this I can see how much missing values there are the size of the data, the data set number of observations etc etc. The pandas profiler is even nice enough to have some qualitative warnings it causes warnings but they're like red flags because since it's automatic and generic, it can be 100% correct but it tries to like hint to the data scientist that you might want to double check this manual. Right. And that's super useful in my opinion, you can just set this up whenever you get a new data set, you can just run this line of code takes a little while to run, because it has to do all the calculations, but think how much time it would take for you to manually write that code and for all the cumulative time it would take for all those bits of code to run. It's not that different. So just let this run, come back in a few minutes, and you'll have your report and the interesting thing here is you get all the details. Think per variable. And you also get the interaction effects which is what you wanted from the paper. The pay plots, you get the correlation heatmap, you get the missing value, this is interesting there is a heat map for missing values here. It's not that interesting to look at for the cars data sets.

But if you if you have come across some register about missing values and how to properly handle them not just blindly remove them or fill them with a value or zero, like to properly handle them you need to figure out what kind of missing, missingness you have missing completely random In which case, or you can drop it or sell it in with a central value is it missing correlated with some other value, like if you if you if you have a data set for classification. And you have some missing values, and it happens that half your missing values are from a single class.

That means that there is something wrong with the way you have collected data and you have to account for it somehow. You can't just blindly remove those data points or send them in with a sentinel value if you do that your, your classifier will learn the completely wrong pattern. So, a heatmap of missing values can be one interesting tool that you can use to figure out if something like this is happening in your data. And again pandas profiler makes it super seamless, that's and that's why I like it.

disclaimer: you need to be wary about javasscript based packages inside the jupyter notebook. If you send too much data to the JS server your kernel will crash or freeze. Some libraries warn you about this.

yellowbrick - somebody else has already done it better

The second tool. We want to I want to show is yellowbrick. So the motivating examples of this is super simple, at least in my eyes I hope it's. It seems as useful to you as well. Um, we do not implement our own machine learning models, ever. Even though, if somebody has had some formal statistics or machine learning training they usually start by implementing their own logistic regression their own neural network from scratch, without using the vats libraries, but in practice we never do that because it's a complete waste of time somebody else has already built and tested the K means classifier in psychology. In reality, 1000s of people have done that for you. And you making it from scratch.

If you have no specialized need is a complete waste of your and your employer's time, which is why it's an unsaid understanding that you just use the implementation from whatever library you have provided us I can learn, or if you have some other statistical your stats model, etc. There is models use for deep learning libraries, there's so many things that are you never, you never implement your own classifiers. But when it comes to other standardized things like checking the performance of your machine learning classifier. We always try to make the visualizations ourselves, which doesn't make sense to me. Because unless you have a very specific business goal, in which case you should be making the visualization, not only the visualization, but also the quantitative metric you should be defining it yourself, in terms of some KPIs that you discuss with your product or business team. But for most of the general machine learning models, the analysis of the model itself is 90% standard same thing as balanced profiler, there is this package called yellowbrick, which contains a set of visualizations, not only for model analysis which is the example I'm using it, but also for exploratory data analysis and diagnostics of models. And I see no reason to implement these visualizations, myself, if somebody else has made them for me and made them in a way which is configurable to a very nice degree, and I'm showing it here like there's a very nice scene visually, a decision boundary visualizer in yellowbrick. I created three models logistic regression decision tree and a decision tree with a max depth of three. So I have essentially regularizer, and I want to see how these three models differ in their classifications, and I make the decision boundary visualization I can see clearly that the decision tree without any cap on it is clearly overfitting and when I put the cap I can see like visually very clearly how it's different. And I can also see the difference between a logistic regression and a decision tree that essentially is creating orthogonal cuts, which makes sense if you know how that works as opposed to other regressions. This is very apparent I like if you try to re implement this visualization it is non trivial. This is one of the comparatively more complex visualizations, but it's straightforward just three lines, and you have the visualization and you can spend your very expensive time, if I think from your employer's point of view.

In actually understanding and explaining the model, rather than fighting with the code itself.

pandas tricks
- take full advantage of what you already have

The next section is about pandas, which I am, I'm willing to hazard a guess that for anyone doing data science in Python pandas is the single most used library in the sense that for any codebase that claims to be data science. It will be touched by it will be interacting with pandas, the most.

Since so many people use pandas and pandas is arguably very easy to use in the sense that you can do a lot of interesting stuff very easily by adding on bits and pieces. Um, it's very easy to start developing bad habits.

In terms of code writing. And since visually and structurally the code you write with pandas is very different from normal Python code that you would be writing let's suppose if you're making a web application in flask.

code is communication

I would say criminal offense that we will do when they write pandas code is write code like this.

Now, this seems like a very like this should be this should be very familiar to you guys I have a data frame I'm doing a filter I'm doing group by and I'm calculating some form of aggregate statistic, in this case the mean per origin for all data. That's right. Let's make this very easy and to a certain degree, this is very easy to easy to read. Even, but it's easy to read for the people who have been trained to read this horrible line of code. This is a very very simple operation. And it's taking up basically half the cells worth. Now, linewidth isn't the problem here. The problem here is that for me to figure out exactly what this line does, I have to read the entire thing left to right, and there is no distinction between the different independent components and there are independent components. I am going to propose an alternate way to phrase this line to write this line, which is syntactically identical. It just visually. It looks different to the human being reading it, because at the end of the day, code is communication. Both between you and your colleagues and between you and your future self.

So just doing this simple change parentheses, on both ends, and then you can put each individual step on each separate line so I start with the cars data frame. I do a filter using the square bracket notation. I do group by origin, and then I aggregate using the mean function, this is straightforward. And I can see clearly there are four steps there are other three steps if you go in first line. There are three computation steps. I can turn on and off things I if I want to check if the mean and how do the mean and median compare I can turn on and off the two things very quickly, even shorter setup and most editors, most good code editors also have this keyboard shortcut of swapping lines. If I want to change the order of lines, I can do that as well, which in this example doesn't really make sense but is useful for more complex data processing pipelines, and that's the word this visually looks like a pipeline, a series of steps that you go through one after the other, in sequential order. And whenever people talk about data processing they always use the word pipeline, so why don't make the code actually look like a pipeline so the communication actually has some meaning behind the intuition.

pandas in pipes

I show a more realistic example, um, these are just two helper functions. This is a mod function just returns the most common value in a list in a pandas series.

this other function is returns the most common and actually this function is easier to explain, explain with an example.

So let's see let's look at this data processor environment. So I'm taking the data frame and filtering it for just the cars that were built in 1971. I'm getting I'm creating a new column which is the manufacturer since if you remember the data frame. It had the car name, and the first word in all the car names is the company name so I'm getting putting it in a separate column. Alright. And then, I'm getting all data points, which belong to the 10 most common car manufacturers. That's what this function is right. It's just some function that I made and I can use it. Now, the details of the function aren't that important. The next thing I'm doing is I'm grouping by the origin country and the manufacturing company and calculating some aggregate statistics. Now these statistics are different for each column and I'm even renaming the returning column, after the fact. All right. And this drop level I will show you what it's doing by removing it for a second. If since I'm doing grouping by two columns. Sorry, since I'm doing a renaming of the resulting columns it's giving me the old column name and the new column name as well I don't care about the Old Parliament so I'm just dropping. I will explain this MultiMax thing in a bit in the next second subsection, so just bear with it for a second. So this is my result.

And this on the surface seems very like a very reasonable bit of code. But you see, syntactically each steps looks very different creating a new column is very different visually then felting it. This function is in a completely different format, this is, this is a function which takes in the data frame and returns a new data frame. This one, I was modifying the existing data, which this this kind of difference in structure builds up, and when some bugs arise, it is not straightforward to identify what exactly happened.

So my proposal, transform the code like this now on surface this looks much more dense and complicated that's because the previous example had a lot of whitespace and comments separating it otherwise there's no difference it's the exact same code. I just structured differently. So this is a data frame then there's a filter then there's the assigned function which creates a new column. And it's the same code as before just inside a lambda, then I'm using this pipe function to pass the data frame into this new into this key prop and function and getting the return value and then I'm doing the group by Ag and then profit, and I get the same result. But to me, this seems more intuitive, I can clearly see the steps that went into getting this single report at the end.

And now this pipeline, I can put this pipeline into a function f it just takes a data frame and returns. This new data frame as a report at the end. And that to me is how I can take the maximum advantage of what the features pandas gives me.

multiindex is your friend

the index is the thing to the left side of the data frame this is sort of the row numbers. Which, in the full data frame looks like this. But, and so in a panda's data frame, there are two indexes. One is the row index which we call the index which you can get by doing data from dot index. The second one is the column index which you can get by doing data frame dot column.

When you get into multi indexes which is indexes which are which consist of more than one variable. You can get those when you have when you group by more than one column.

So you can see that we have multi index on the rows and a single index on the columns, but let's focus on the row index.

If I get the index now instead of just a list of values as it was before now it's a list of tuples. It makes sense how it maps to the visual.

Earlier, I did an aggregation on all columns so I got a data frame and return, what, what would I get if I do an aggregation on a single column I should get a series, right?

So, this is a multi index series, instead of a multi index data.

If I get the index of this, it is identical to the index i got before. So this should give you an idea that the indexes are their own separate entities inside data frames or CDs, and they're independent from the content that they that they attach.

So understanding the index in pandas gives you a lot more control of how you do filters and how you do subsetting, and access and cross comparison of data frames. There is a lot of times when I see people doing semi complicated comparisons by writing some nested for loops and comparing each element of one data frame to the other on you can do that with a couple of tricks with indexes. It's not completely straightforward because like I said it's so complicated the operation itself is semi complicated, but if you do it with using the indexes, it's going to be significantly faster. And if you're using dedicated data science libraries, then you should at least take the advantage that the, that they provide. Right.

So let's see what basic operations we can do with the, with the an index.

I hope you are familiar with the .loc and.iloc from lock stands for location are located. That's what I think and I lock stands for integer location. So, to lock you give the value you see visually that's the represent a human readable version of the index while of you give zero based indexes, for it. So lock is more interesting and human readable.

So, to lock if I give Europe, it gives me all three rows which correspond to Europe and it not only gives you values in those three rows, but also the second column of the index.

If I give it a tuple. It gives gives a single number as the output, which is to be expected because remember the index was a list of tuples.

If I want to do the reverse of what this dot Lok erupted. I want to get all the rows with four. I can't unfortunately pass dot log four because it by default tries to check the first column. So, the syntax is a little bit more cumbersome, but it's I think it's still easy to understand it, you just need to note that there is a function called xs.

Now if you've worked with NumPy or pandas before there are a lot of functions which take the axes are access argument, and you're supposed to give zero if you want a row wise operation one if you want a column as a percent in NumPy you can have higher access values well pandas in pandas relatively recently, I think. Until recently, it is basically all the functions that take the access argument now takes string arguments as well so you can say index, which corresponds to zero and columns which corresponds to one, which is in my opinion, far superior. And the third argument is level this is the level of the multi index.

Now I can, if I put level zero here and I say Europe here instead of four, it is equivalent to using the lock command above, but then more syntax. So you can use this this is the more general version of using the loc.

The next thing you want to do is swap the order right because I mentioned that by default lock uses the first column so maybe I want to filter, more commonly based on the number of cylinders so I want that to be the first column. So I can use lock because it looks nice and then using access. So I would use a dot swap level function but if I use it, it turns into this, which is not really what you want, I assume, and 90% of the time, at least, so then you called our sort index on it and it automatically does the next nested to level sort it

Most pandas functions do the obvious thing most of the time. And if they don't do the obvious thing in the case of swap level, there's a very simple way to get them to do what you want.

And now we're back to drop level which I used earlier drop level, it takes the access and level arguments and essentially drops the level. And you will notice that it did the group the things, because that's not what I told you to do I just told you to drop the level, if I wanted to regroup The thing I need to do another group by an aggregation, which I can but that's not the thing I'm expecting it.

pivot tables for every taste

The next thing is, pivot tables, which are a very very common thing in very very common data, data analysis operations for people in business intelligence business analysis, less so for people doing modeling and, like, the stereotypical thing that a data science scientist does but if you want to do general purpose analysis, pivot tables are your friends and these, and I like them personally because they make it easier for me to communicate with my colleagues who are not in a technical role, like me, I can use the terminology of a pivot table to talk to anybody who has gone to West taken any course, basically, because they learn how to make pivot tables in Excel so I can, I can have a very, very productive discussion with them.

Just using the pivot table, but the thing I want to bring your attention to is that pivot tables and multi index group bys are identical to each other, and I've created an example here. Now, these are two functions called stack and unstack, which basically work on multi indexes and take an either break up a multi index into the other access or the other or combine two indexes into one thing.

tidy-ness is a virtue

The examples are from Daniel Chen | Cleaning and Tidying Data in Pandas and the talk goes into more detail about the concepts and functions we will briefly cover in this section.

Here the concept we want to talk about is a tidy data frame. Conceptually, a data frame is a table where each row is one observation and each column is one variable/field/attribute of that particular observation. That's how we learn about tabular data especially if you are coming from databases background. But you could also come up with examples of tables in databases which do not follow this structure and I'm willing to bet if somebody has done any data science work on actual real world data they would have come across data sets which don't follow this pattern.

Let's look at the example below which is data set of religious affiliation and income brackets. The values in the cells contain the number of people which have the income level in the column and follow the religion in the row.

This seems okay on the surface, so let's ignore the structure of the table for a second and talk about what information we have here. We have information about what religion we have information about what income bracket and we have information about the number of people. So we have three variables. So, considering the definition of tidy data, we should have three columns, where each row tells us a single religion and income level combination and the number of people belonging to that combination. That should be the tidy structure of this data but the table we see does not follow it. Visually this table in front of us is wider than the tidy table would be. It has more columns and less rows so we call it a wide-form table.

This should look similar because in the previous section we discussed pivot tables which are by design wide form, because they are easier for human beings to get information out of. This is why we often get data in a wide form table. We usually restrict pivot tables to 2 or 3 variables because they are meant to be a human-consumable report, not a raw data store. However this exact fact makes this format problematic for general-purpose data analysis tools. These tools work best with long-form tables which will have the same structure no matter how many variables you include (this is not true for pivot tables).

There are two functions in panads - melt and pivot_table which convert from wide to long and vice versa respectively. Even though I do most of my work in pandas and python I prefer the API of the R functions gather and spread from the aptly named tidyR package.

Here I am converting our wide table into a long one using melt. For this particular case since we want all the columns except our id_vars to be included in value_vars we can skip the long list argument as the default value will take care of it for us.

The resulting dataframe follows our definition of tidyness perfectly.

Data manipulation and reshaping is never as simple as calling one of the functions I've mentioned and calling it a day. Here I clean the contents of the string columns and convert the income variable into a categorical (R users should be familiar with the factor data type) so I can enforce an ordering which will be followed whenever I create an aggregation or plot.

In this format the syntax for filtering will be consistent no matter how many features you include.

Conducting simple statistical aggregations is now also consistent irrespective of the number of variables you have. Notice how the sort_index at the end is not sorting alphabetically but following the order we specified above.

Most of the common data anlaysis libraries will expect data in this format so their functions and methods are designed to be easy to use with tidy data. I have seen people try to force a wide-form table into pandas and then complain that the tools are too complicated to use. In my opinion this is not only them misunderstanding the tool but also underestimating the importance of data preprocessing cleaning which is often relegated to just being df.fillna(0).

I want to emphasize again that there is no silver bullet for data manipulation. You have to use all the tools provided by your language of choice to convert the messy data you recieve into something tidy so your downstream EDA, analysis and modelling steps are simpler.

these pandas love dates

Something I've seen people relatively less familiar with is the rich datetime functionality in pandas. Wes Mckinney created pandas while he was working at an investment management fund and time-series analysis is a fundamental part of a lot of quantitative finance (or so I've been told 😂).

Let's look at the year column from the cars table. Notice the dtype at the bottom, pandas is storing this column as a datetime type with precision upto nanoseconds.

This is obviously wasteful for this scenario because our column only contains the year as actual information and it would be sufficient to store it as an int or even better as an ordered category.

In most actual cases when you read a date from a file pandas will load it as a string (the dtype will be object). You can force pd.read_csv etc to convert columns to dates at load time by using the parse_dates argument. Or you can use this function to convert a string into the proper type.

There are many features pandas provides you with this datatype but I will only cover one here. Consider this pivot table from section 2.4. I have slightly modified the example to use the dt accessor to extract just the year value as an integer and given that to groupby. I think the resulting pivot looks better than seeing full dates here.

Now let's recreate this using the resample function which like groupby but it knows it's working on a datetime column. The result will be basically identical except for the column labels. I've created a small function to clean that for us.

Now you should be ask be why is this useful? Because all we've done so far is get the same output but with more code.

Let's consider another example where instead of grouping by each single year we want to group by two years together. So 1970 and 1971 should be in a single column and so on. You should try to think how you can do this in the simple implementation from section 2.4.

The resample was made to solve this exact task. I've made another function to nicely format the columns but the only change in the analysis code is a single digit.

honorable mentions:

python features
- there are reasons people liked using python even before it became popular for DS

Now on to the language features themselves. These are features of the Python language itself so I don't need to import any library or install anything to get the benefit. I want to bring your attention to two features. One is generators. The other is context managers. I'll give just one example of each but I want to clarify that these are general features of the language so if you're creative you can do a lot of different things with them.

Here I'm linking two talks from three of the most fantastic python speakers which helped me appreciate the language. The last talk in particular is my inspiration for the examples in this section (it goes in more depth of course)

honorable mention: A simple progress bar package - tqdm. Install this and get in the habit of using it with slow loops.

generators

Without going into the technical details of the iteration protocol, a generator makes something that you can loop over. This definition is enough for the example I want to show.

There is a folder with some files which contain some data that I've recieved. You can think that these are extractions of historical daily data in each file and I have to do some analysis on each file and create a separate summary report for each day.

This file_loader function returns something called a generator object. As I mentioned all you need to know here is that this object is something you can loop over just like a list.

Here I'm looping over the generator (tqdm supporting with a progress bar) and conducting my analysis inside and saving the result.

The resulting files can be seen here. The thing I want you to notice is that at any point in this loop only 1 file was ever loaded and in memory. This means that it doesn't matter what the total size of the directory is, as long as I can work on one file at a time I can eventually cover them all.

This is a very trivial example only meant to show you how generators work. Behind the scenes generators are a major part of the iterator protocol in python and form the foundations of modern concurrent programming in python through coroutines.

context managers

You will have seen these in the context of loading files from disk


Some of you may have seen them used to work with a database connection

A less common example (but more relevant to our work) is using context managers to control the style of plots. Look into the plt.rcParams dictionary for the properties you can change and their default values.

You can also make your own context managers as well to separate the messy details of some process which needs a specific enter and exit steps which must always take place in that order. Here I'm using a trick with a generator and the contextmanager decorator to create a context manager which creates and deletes a temporary table called 'stocks' at appropriate times.

I know I have not done justice to these versatile features and I encourage you to watch the talk by James Powell (linked above) to see the flexibility these provide you.

efficient plotting
- there is no need to fight with the tools

grammar of graphics

Exploratory Data Analysis is a 1970 book by John Tukey. He coined the term data analysis 9 years before publishing this book. Think about that everytime you think data science is some hip new field 😉. Tukey talked about the steps you should go through before making any claims of modelling assumptions about the data. He not only explained the value of these exercises but also invented a lot of the visual tools we use every day.

The Grammar of Graphics is a 1999 book by Leland Wilkinson. It proposes a way to think about the mechanics of making visualisations. If Tukey taught us about the content, Wilkinson taught us the language in which we should express that content. Instead of attempting to explain this concept myself fully I will point you to one of my favourite explanations by Jake VanderPlas - How to Think about Data Visualization.

drawing

I like this pyramid representation of the grammar because reinforces the idea that each level builds on top of the previous one - it is an additive process. If you are a ggplot user the terms like 'aesthetics' and 'geometries' will be familiar to you. That's because the 'gg' in ggplot stands for 'grammar of graphics'. It was designed to be a programming implementation of the concepts proposed by Wilkinson.

The fundamental python plotting library matplotlib was made much earlier than ggplot and it was designed to mimic MATLAB's plotting API. Irrespective of what your feelings are towards MATLAB, there are certain historical reasons why this choice was made. If you are curious about the history of the python data science ecosystem I encourage you to check out another Jake VanderPlas talk about this topic.

You might have heard people say that they prefer R over python because it has much better plotting capabilities. Not to add fuel to the R vs python fire but I have always found this argument a bit disingenuous because people compare ggplot to matplotlib when they should be either comparing base graphics in R to matplotlib or ggplot to seaborn in python.

The grammar is designed to work with tidy data (refer to section 3.5). Let's look at this table as an example.

This is an example of wide data. The builtin pandas plotting interface works extremely well with this kind of data.

But like I said if we want to take advantage of the grammar we have to get tidy first.

seaborn is awesome

In this section we will see how seaborn provides a grammar of graphics inspired plotting interface in python.

variable type relationship type axes-level figure-level (FacetGrid)
one numeric raw stripplot, swarmplot catplot
one numeric summary boxplot, boxenplot, violinplot catplot
one/two numeric distribution histplot displot
two numeric raw scatterplot relplot
two numeric raw/summary lineplot relplot
two numeric model regplot lmplot
one categorical and one numeric summary pointplot, barplot, countplot catplot


This looks unexpected. What's the shaded region?

In our data there are 3 values of num_cars (y axis variable) for each Year (x axis variable) and seaborn, by default, plots the mean as a line and the 95th confidence interval in the shading.

Take a look at estimator and ci arguments (examples of their usage are in the seaborn docs lineplot page)

Specifying a variable for hue here removes the shading since there is only 1 data point in a year for each origin country so a confidence interval can't be calculated.

I've changed the function name here (refer to the table above). Since this is a more general figure-level function I have to tell it what kind of plot I want using the kind argument. The result is basically identical (as it should be).

We access the subplotting mechanism in this figure-level function through the row and col arguments.

This is my version of the raw matplotlib code which creates basically the same chart. Seaborn is handling all this and more (notice the legend) behind the scene and providing us a clean interface inspired by the grammar of graphics.

disclaimer: the order of facets is sometimes different in both approaches

even more seaborn aweomeness

We can use the returned FacetGrid object to add more things to the structured plot. Here I'm adding a horizontal line on all charts showing the overall average. Notice that the call to map has removed the axis labels.

We can also create the FacetGrid object ourselves and make even the linechart manually using map_dataframe.

In addition to FacetGrid, which provides column and row facetting, there is also PairGrid (facet for all combinations of variable pairs) and JointGrid (for margin plots).

this philosophy pays off

Understanding the grammar of graphics is not only beneficial on a conceptual level fpr structuring and thinking about data visualizations but it also provides a foothold on which you can stand to learn more tools.

The famous interactive plotting library plotly has a seaborn-inspired interface - plotly.express. With near identical syntax to the previous example here we have a fully interactive plot.

Another example is the relatively recent addition to the pyData ecosystem - altair. While the syntax is much different from seaborn you should be able to identify the important common bits.

This consistency in interface isn't limited to python. Of course you could jump to the original - ggplot in R. Or even leave programming entirely and use a tool liek Tableau which provides a drag-and-drop interface to the grammar.

Simple, custom interactivity

Another feature which I see criminally underused is the ipython interactive widgets. These come bundled with all installations of jupyter notebooks and work seamlessly. While not being fancy-looking enough to replace proper dashboarding tools (dash, streamlit, panel etc) they are a quick way to interactively interrogate your data.

In this example I have a function which makes a simple histogram. The arguments control the number of bins and whether to sue logarithmic scale or not.

I can make a very simple 'dashboard' to thoroughly explore this univariate distribution.

sklearn, the proper way

The scikit-learn package should't need any introduction for an audience of data scientists. It is the machine learning library in python - it provides a clean, simple API for models which has been mimicked by many ML packages even in other languages. In addition to providing a breadth of ML models for all kinds of tasks, it also comes with performance metrics, model selection schemes and methods to generate dummy/synthetic data.

The builtins in scikit-learn are more than enough for most data science exercises but sometimes we have to create a custom 'pipeline' with multiple parts. I usually see these situations tackled with adhoc functions which implement the bespoke logic. As soon as this path is taken, we lose interoperability with other parts of scikit-learn. This interoperability is one of the key reasons scikit-learn is so good at what it does so why do we have to lose it when our tasks become just a little more complex?

In this section we will see that we can keep the interoperability and implenent our custom logic both. Yes!! In this case we can have our cake and eat it too.

Custom Transformer

The first example is to make a custom imputer (a transformer which fills missing values). We want to make a imputer which fits a normal distribution to the available data and samples from the fitted distribution to fill the missing values.

Instead of making adhoc functions we are going to create a class which will extend the default sklearn transformer. If you have ever used a sklearn transformer, you might remember that the key methods are fit and transform. We are just going to take the logic from our adhoc functions and move them into these two methods in the class.

disclaimer: Notice the 2 classes which we are inheriting from. There are important reasons (related to how sklearn is implemented) which force us to create all our custom classes like this.

These are some helpful functions to work with sklearn pipelines.

These are some relatively new features which improve the integratio with pandas.

The Pipeline provides an interface to chain transformers (and even models) together. This is very useful when you have a particular sequence of transformations that need to be applied to the data everytime. You can create a pipeline for this sequence and use it just like a regular transformer (call fit and transform on it).

The ColumnTransformer can apply different preprocessors (of course including Pipeline) to different groups of columns. Here I'm using it to apply different preprocessing pipelines to numeric and categorical columns.

I have added my custom imputer to the numeric pipeline. I can use it seamlessly with other sklearn constructs (this can incldue GridSearchCV etc as well) because I made it by extending the core skelarn classes.

Let's apply our full preprocessor to the cars data.

The pandas integration has improved very much but sklearn transformers still return numpy arrays which means we lose column names. I'm using one of my helper functions to bring them back.

Notice the 'Origin' column above. It's been encoded as integers instead of the original strings for use in an ML model but now it's very hard to know which number corresponds to which country. The second helper function extracts this information from the preprocessor. All transformers in sklearn have their own internal state which remembers information about the raw data.

Custom Model

The second example is to make a custom model. Now obviously I'm not advocating you all to start implementing you own versions of linear regression adn kmeans and put them in production instead of the battle tested versiosn from sklearn 😂. In practive the models we use to solve business problems are rarely vanilla ML algorithms. Usually they have some fallbacks/rules which work on top of them to ensure we don't use a poor prediction from the model to take some action. We could also need to combine the predictions from several models and take a single action at the end. This is the scenario we are going to work with here.

I have two calssifiers - a regression and a tree. I only want to predict when both classifiers agree, otherwise I want the model to return NaN. The logic is simple enough and can be implenented in a short function but again I am going to extend the core sklearn estimator class.

Of course I can create and use the classifier just like every other sklearn model.

The predict method is applying our custom logic correctly.

We can seamlessly use this model with any other sklearn construct.

Full Pipeline

When I was explaining an sklearn pipeline I mentioned that it can contain a model as well as transformers. Let's see this in action.

Here I've recreated the column transformer construct from before, minus the categorical part. This is because the only categorical variable we have is our target and passing in the y values to a column transformer's fit causes some problems so I prefer to preprocess the target column separately.

Now we can combine our column transformer and classifier into a single 'full' pipeline.

Fitting the pipeline will call fit appropriately on all it's constituents in order (transformers and estimators both).

Asking the pipeline to predict something will first call transform on the transformers in sequence then the result will be passed to the predict on the classifier at the end.

There's much more to see